Introduction

America is a highly diverse country. It is not only diverse in terms of ethnicity, but also in terms of income, industry, and law. This opens the doors for a variety of possible interactions between these variables. What factors drive the way that income is distributed in the United States? What factors reliably predict whether the average income per capita in a specific area is high or low? How does state level variations in law and freedom impact income?

How are these questions SMART?

These questions are important because they tell many facets of the story of consumption in the United States. Income serves both as a measure of productivity and lifetime consumption (although this analysis does not disentangle the two). Although their scope is broad, they remain specific to the concepts of income, demographics, and freedom, and maintain a consistent structure: how do demographics and freedom drive income in the United States, at the census tract level.

These questions also correspond to a set of highly measurable (And luckily, premeasured) variables. Income can be imputed from tax records, while ethnicity and work status are available from census forms. Achieving the answers to the questions is made simple by the cleanliness and availability of this data; since few data points are missing across all census tracts throughout the 50 states of interest, it is simple to form statistical tests.

Finally, these questions are relevant to policy makers who want to improve the incomes of their constituents as well as to researchers interested in establishing a baseline for the average income they should expect a community would earn based on its demographics. These are critical questions, because the ability of communities to support themselves economically has massive impacts on the wellbeing of their members.

Content

First, an examination is conducted on how the US Census Bureau database is structured, and which variables were included. Secondly, the groups of independent variables and how each of them could affect the income per capita of a community is presented. Then, an exploratory data analysis and some statistical tests are made to evaluate the significance of the variables. Later on, the models are executed, including clustering, classification, and regression. Finally, a conclusion looks into further challenges and questions necessary to enhance future analyses.

Dataset

U.S. Census Bureau Dataset

The U.S. Census Bureau Data holds the yearly American Community Survey: a project which asks Americans around the country about several dimensions of their lives, including work, income, demographics, and other activities (U.S. Census Bureau, 2019). The dataset from 2015 was available via Kaggle (MuonNeutrino, 2015), and included more than 74,000 observations, with 37 columns (variables). The dataset includes two variables related to income: the median household income and income per capita. The variable income per capita was prefered because it adjusts per person, and not per household given that it’s unknown how many people can live in an average household. The variable income per capita (IncomePerCap) is calculated as the average income per capita of the population of a specific census tract. But, what is a census tract and why use them?

Census tracts

Household’s income in America varies significantly by geographical location. The richest counties in the country are concentrated in urban areas near big metropolises where most businesses are located. The bay area in northern California, Northeast Virginia and New York are some examples. However, counties have been an insufficient unit to compare different variables among them. There are 3,142 counties in a country of 300 million inhabitants (U.S. Census Bureau, 2019), but among them are several inconsistencies. Texas, for example, has 254 counties (U.S. Census Bureau, 2017). California, a state with approximately 10 million people more than Texas, has only 58 counties (U.S. Census Bureau, 2017). Population-wise California has the largest county in the country with more than 10 million inhabitants (Los Angeles), whereas Texas has more than 80 counties with less than 10,000 people (U.S. Census Bureau, 2017). Density-wise, New York has 4 of 5 of the most dense counties in the country, some of them 60,000 times more dense than counties in Hawaii, Alaska or Nevada (U.S. Census Bureau, 2013). As a response to these inconsistencies found in counties in America, the U.S. Census Bureau delineated “Census Tracts” at the beginning of the twentieth century. A census tract is “geographic region defined for the purpose of taking a census.” Over the years, the U.S. Census Bureau has established census tracts in every county in America. There are over 74,000 census tracts in the country and a typical one has around 4,000 or so residents. There is a strength that comes from this consistency: census tracts are by and large similar in population size, and the population size of census tracts does not vary much from state to state.

Description of Variables

The complete dataset includes 17 independent variables and 1 dependent variable. Thanks to their nature, the independent variables were classified in three groups: Work Variation and Ethnic Variation.

Work Variation:

Professional: Percentage (%) employed in management, business, science, and arts in a census tract.

Service: Percentage (%) employed in service jobs in a census tract.

Office: Percentage (%) employed in sales and office jobs in a census tract.

Construction: Percentage (%) employed in natural resources, construction, and maintenance in a census tract.

Production: Percentage (%) employed in production, transportation, and material movement in a census tract.

Unemployed: Unemployment rate (%) in a census tract.

Self-employed: Percentage (%) self-employed in a census tract.

Ethnic Variation

Native: Percentage (%) of population that is Native American or Native Alaskan in a census tract.

White: Percentage (%) of population that is white in a census tract.

Black: Percentage (%) of population that is black in a census tract.

Hispanic: Percentage (%) of population that is Hispanic/Latino in a census tract.

Asian: Percentage (%) of population that is Asian in a census tract.

EDA

The exploratory data analysis of this dataset was divided into 4 sections:

  • Population of census tracts
  • Income per capita
  • Independent variables
  • Correlation matrix

Population of census tracts

Outliers identified: 3589 
Propotion (%) of outliers: 5.2 
Mean of the outliers: 74140.26 
Mean without removing outliers: 28491.23 
Mean if we remove outliers: 26139.73 
Outliers successfully removed 
   Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
      5    2929    4086    4369    5460   53812 

A baseline analysis of population and income was conducted. The histogram for population appeared skewed to the right. The different census tracts had similar population counts with a mean of about 4000. Counties were not evenly spread out as some had a population of 1 million and others 10 million. With similar populations, census tracts were easier to investigate instead of counties. The Q-Q plot confirmed the non-normality as the values between quartiles 3 and 4 were far away from the line.

Income per capita

   Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
    128   18776   24730   26140   32247   56040 

Outliers and NA’s were removed from our dependent variable income per capita. Note that the histogram appears normal, tails of q-q are closer to the line, and mean and median are much closer together. Outliers were taken out of this variable because super wealthy individuals could have been exercising a significant pull on the distribution .

Independent variables

Work variation

Next the seven variables for work variations (professional, production, unemployment, office, service, construction, self-employed) were assessed for normality.

 Factor w/ 4 levels "[0,5.3]","(5.3,7.9]",..: 2 4 2 3 1 3 3 3 3 2 ...

   Min. 1st Qu.  Median    Mean 3rd Qu.    Max.    NA's 
  0.000   5.300   7.900   9.251  11.600 100.000     101 

The boxplots that exhibited a decrease in income, as more of the specific work variation was included in the census tract, were unemployment, service, construction, and production. That is to say, as more unemployed individuals were accounted for in a given census tract, the income per capita decreased.

 Factor w/ 4 levels "[0,23.7]","(23.7,31.7]",..: 3 1 2 2 4 2 1 4 2 2 ...

   Min. 1st Qu.  Median    Mean 3rd Qu.    Max.    NA's 
   0.00   23.70   31.70   33.23   41.80  100.00     105 

The only work variation that exhibited an increase in average income was professional work. Looking at the histograms of each of the variables it appeared that only the proportion of professionals was distributed normally. For professionals, the Q-Q plots affirmed the normality as the plot did not have the error terms straying far from the line with very small right and left tails. The same cannot be said for the other variables as each had an oversized right tail and a relatively small left tail. Overall the proportion of professionals appeared normally distributed while the other work variations did not.

The remaining variables of office and self-employed remained relatively stable across quartiles. The remaining six work variations were all skewed to the right.

Ethnic variation

Finally the five ethnic variables (Native, White, Black, Hispanic, and Asian) were investigated.

 Factor w/ 4 levels "[0,37.1]","(37.1,70.3]",..: 3 2 3 3 2 3 3 3 4 3 ...

   Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
   0.00   37.10   70.30   61.24   88.40  100.00 

The boxplots for White showed an increase in average income between the first second and third quartiles but no change in the fourth.

 Factor w/ 4 levels "[0,0.1]","(0.1,1.2]",..: 2 3 3 1 3 1 1 1 1 2 ...

   Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
  0.000   0.100   1.200   4.347   4.400  91.300 

The boxplot for Asian showed an increase from the first through the fourth quartile.

 Factor w/ 4 levels "[0,2.4]","(2.4,7.2]",..: 1 1 1 3 1 3 2 1 1 1 ...

   Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
   0.00    2.40    7.20   17.36   21.50  100.00 

The boxplots for Hispanic slightly increased between the first and second quartile but did not change for the third quartile. The fourth quantile for Hispanic decreased significantly.

 Factor w/ 4 levels "[0,0.8]","(0.8,4]",..: 3 4 4 2 4 3 4 3 3 3 ...

   Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
   0.00    0.80    4.00   13.78   15.32  100.00 

The boxplot for Black increased in average income between the first and second quartile. Then there was a decrease in average income from the second to the fourth quartiles. Overall, it appeared that average income did change based on concentration of ethnicities in a census tract.

The histogram for White was bimodal with the highest frequency at over 8,000. The histograms for the other four ethnicities were skewed to the right. Based on the histogram, it appeared that white had the highest responses followed by Hispanic, Black, Asian, and Native.

    Min.  1st Qu.   Median     Mean  3rd Qu.     Max. 
  0.0000   0.0000   0.0000   0.7567   0.4000 100.0000 

Also, there were not enough responses from the Native ethnicity to construct a meaningful boxplot. Therefore, based on the assessment of the boxplots, histograms, and Q-Q plots, none of the ethnicities appear normally distributed.

Correlation Matrix

Models: Why perform both clustering and prediction?

Throughout this report, both clustering models and prediction models were performed. Clustering models, such as PCA and K-means, are unsupervised learning methods. The objective is to find patterns in the data that help understand the relationship and closeness of different variables, and consequently subgrouping the dataset based on those patterns. Overall, It is important to use supervised and unserpervised models as it helps create a better holistic understanding of the data. In this study, the initial purpose was to understand income per capita. However, there are multiple independent variables that, according to this EDA, could be heavily correlated. Therefore, a clustering analysis could help us understand these relationship and make better conclusions about the reality of income in the U.S.

On the other hand, prediction models, such as KNN classification and regressions, are supervised learning whose main purpose is to understand the behaviour of a dependent variable. As shown in the EDA, the analysis was includes 10 independent variables and a single dependent variable: income per capita. Then, running classification and regression models was indispensable in this analysis.

PCA

A Principle Component Analysis (PCA) and Principle Component Regression (PCR) seemed suited to this dataset. The purpose of this technique is to decrease the number of variables while accounting for collinearity. Within this dataset there were 10 variables to explain IncomePerCap. However, the correlation matrix showed notable correlation between some of the predictor variables. For example, Professional had notable correlations with Service, Construction, Production and Unemployment, White had notable correlations with Hispanic and Black, etc. From this inital overview of the correlation matrix, the prospect of PCA seemed suitable and was continued.

The biplot above analyzed over 70k+ data points, resulting in the dense scattering of data. The axes of this plot were PC1 on the horizontal and PC2 on the vertical. PC1 had the most variation, between approximately -5 to 10, while PC2 went between -8 to 10. The variables White, Production, Unemployment, Black and Professional were pretty evenly split up between Pc1 and PC2. Other variables, such as Office, Service, Unemployed, Construction, etc. were majorly represented in either PC1 or PC2.

Importance of components:
                          PC1    PC2    PC3    PC4     PC5     PC6     PC7
Standard deviation     1.7878 1.3389 1.1653 1.0355 0.88819 0.82267 0.76933
Proportion of Variance 0.3196 0.1792 0.1358 0.1072 0.07889 0.06768 0.05919
Cumulative Proportion  0.3196 0.4989 0.6347 0.7419 0.82078 0.88845 0.94764
                           PC8     PC9     PC10
Standard deviation     0.71304 0.12303 0.003342
Proportion of Variance 0.05084 0.00151 0.000000
Cumulative Proportion  0.99849 1.00000 1.000000

The breakdown of the variation explained by each component showed that just over 60% of the variation was accounted for within the first three components. However, except for the first component, the change in the amount of variation explained in each consecutive component was similar. This was further illustrated by the following graph.


Call:
lm(formula = IncomePerCap ~ ., data = pcadata_pcr_rot)

Residuals:
   Min     1Q Median     3Q    Max 
-57889  -3154   -136   3093  39355 

Coefficients:
            Estimate Std. Error  t value Pr(>|t|)    
(Intercept) 26167.82      20.93 1250.463  < 2e-16 ***
PC1         -4585.05      11.71 -391.701  < 2e-16 ***
PC2         -1454.29      15.63  -93.043  < 2e-16 ***
PC3           604.54      17.96   33.664  < 2e-16 ***
PC4           994.55      20.21   49.214  < 2e-16 ***
PC5          -878.20      23.56  -37.274  < 2e-16 ***
PC6          1377.18      25.44   54.140  < 2e-16 ***
PC7          -205.74      27.20   -7.564 3.96e-14 ***
PC8          -196.99      29.35   -6.712 1.93e-11 ***
PC9          3301.06     170.10   19.407  < 2e-16 ***
PC10        -3519.28    6262.21   -0.562    0.574    
---
Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1

Residual standard error: 5519 on 69556 degrees of freedom
Multiple R-squared:  0.7102,    Adjusted R-squared:  0.7101 
F-statistic: 1.704e+04 on 10 and 69556 DF,  p-value: < 2.2e-16

A full principle component regression was performed, and all except the last component were deemed significant. It was also notable that this regression explained 71.01% of the variance in the dataset according the the adjusted R-Squared. The strongest variable was of course PC1 with a t-value with a magnitude by far larger than the rest of the variables.

A plot of the R-Square values over the number of components explained the amount of variation explained in the independent variable, IncomePerCap, based off of the components. The steeper increase and then petering off that occured in the R-Square graph seemed to indicate that a significant amount of the variation of the data in regards to IncomePerCap was explained using just the first component. Based on the initial analysis of the R-Square graph, and the results of the regression it seemed appropriate to run a regression on just PC1 which resulted in a lower Adjusted R Square.


Call:
lm(formula = IncomePerCap ~ PC1, data = pcadata_pcr_rot)

Residuals:
   Min     1Q Median     3Q    Max 
-42965  -4026   -442   3546  36606 

Coefficients:
            Estimate Std. Error t value Pr(>|t|)    
(Intercept) 26167.82      23.34  1121.0   <2e-16 ***
PC1         -4585.05      13.06  -351.1   <2e-16 ***
---
Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1

Residual standard error: 6157 on 69565 degrees of freedom
Multiple R-squared:  0.6393,    Adjusted R-squared:  0.6393 
F-statistic: 1.233e+05 on 1 and 69565 DF,  p-value: < 2.2e-16

The results of this regression on just PC1 corroborated that the R-Square is smaller, at a value of 63.93%. The tradeoff between parsimony and description of these two potential models made the choice of model unclear. Assuming the more explanatory model, which accounted for the number of components included by adjusted R Square, was chosen, only one component would be removed. This was not an effective parsing down of variables. However, there would be low bias since only one component was dropped.

K- Means

List of 9
 $ cluster     : Named int [1:69567] 2 2 2 2 2 1 2 1 2 2 ...
  ..- attr(*, "names")= chr [1:69567] "1" "2" "3" "4" ...
 $ centers     : num [1:2, 1:11] -0.329 0.174 0.42 -0.222 -0.32 ...
  ..- attr(*, "dimnames")=List of 2
  .. ..$ : chr [1:2] "1" "2"
  .. ..$ : chr [1:11] "Hispanic" "White" "Black" "Asian" ...
 $ totss       : num 7.31e+12
 $ withinss    : num [1:2] 1.15e+12 1.34e+12
 $ tot.withinss: num 2.5e+12
 $ betweenss   : num 4.81e+12
 $ size        : int [1:2] 24086 45481
 $ iter        : int 1
 $ ifault      : int 0
 - attr(*, "class")= chr "kmeans"
K-means clustering with 2 clusters of sizes 24086, 45481

Cluster means:
    Hispanic      White      Black      Asian Professional    Service
1 -0.3285506  0.4200874 -0.3199683  0.2372900    0.9282858 -0.6150829
2  0.1739950 -0.2224715  0.1694500 -0.1256649   -0.4916051  0.3257379
       Office Construction Production Unemployment IncomePerCap
1 -0.01890396   -0.4302842 -0.6556146   -0.5268802     37598.33
2  0.01001123    0.2278715  0.3472029    0.2790272     20114.40

Clustering vector:
 1  2  3  4  5  6  7  8  9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 
 2  2  2  2  2  1  2  1  2  2  2  2  2  2  2  1  2  2  1  1  2  2  1  2  2  2 
27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 45 46 47 48 49 50 51 52 53 
 2  2  1  1  1  1  1  2  1  1  2  1  1  2  2  2  2  2  2  2  2  2  2  2  2  2 
54 55 56 57 58 59 60 61 62 63 64 65 66 67 68 69 70 71 72 73 74 75 76 
 2  2  2  2  2  2  2  2  2  2  2  2  2  2  2  2  2  2  2  2  2  2  2 
 [ reached getOption("max.print") -- omitted 69492 entries ]

Within cluster sum of squares by cluster:
[1] 1.153649e+12 1.344211e+12
 (between_SS / total_SS =  65.8 %)

Available components:

[1] "cluster"      "centers"      "totss"        "withinss"     "tot.withinss"
[6] "betweenss"    "size"         "iter"         "ifault"      

List of 9
 $ cluster     : Named int [1:69567] 2 1 1 2 2 2 1 2 2 2 ...
  ..- attr(*, "names")= chr [1:69567] "1" "2" "3" "4" ...
 $ centers     : num [1:3, 1:11] 0.432 -0.24 -0.363 -0.575 0.34 ...
  ..- attr(*, "dimnames")=List of 2
  .. ..$ : chr [1:3] "1" "2" "3"
  .. ..$ : chr [1:11] "Hispanic" "White" "Black" "Asian" ...
 $ totss       : num 7.31e+12
 $ withinss    : num [1:3] 4.33e+11 3.94e+11 3.93e+11
 $ tot.withinss: num 1.22e+12
 $ betweenss   : num 6.09e+12
 $ size        : int [1:3] 27175 29638 12754
 $ iter        : int 2
 $ ifault      : int 0
 - attr(*, "class")= chr "kmeans"
K-means clustering with 3 clusters of sizes 27175, 29638, 12754

Cluster means:
    Hispanic      White      Black       Asian Professional    Service
1  0.4318377 -0.5751979  0.3952287 -0.15378102   -0.7689991  0.6127621
2 -0.2397814  0.3399978 -0.2076592 -0.02410908    0.1329131 -0.2111302
3 -0.3629095  0.4354829 -0.3595529  0.38368702    1.3296436 -0.8149861
       Office Construction  Production Unemployment IncomePerCap
1 -0.02605785    0.2974405  0.51204618    0.6194822     16576.79
2  0.07327428    0.0108065 -0.07894429   -0.3101802     27821.96
3 -0.11475466   -0.6588701 -0.90756657   -0.5991303     42759.54

Clustering vector:
 1  2  3  4  5  6  7  8  9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 
 2  1  1  2  2  2  1  2  2  2  2  1  1  1  2  2  2  1  3  3  2  2  2  1  2  2 
27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 45 46 47 48 49 50 51 52 53 
 1  1  2  2  3  2  3  1  3  3  2  2  2  2  1  1  2  1  1  1  1  1  1  1  1  1 
54 55 56 57 58 59 60 61 62 63 64 65 66 67 68 69 70 71 72 73 74 75 76 
 1  1  2  1  1  1  1  1  1  1  1  2  1  1  1  1  1  2  1  1  1  1  1 
 [ reached getOption("max.print") -- omitted 69492 entries ]

Within cluster sum of squares by cluster:
[1] 433025278355 393691127581 392888818743
 (between_SS / total_SS =  83.3 %)

Available components:

[1] "cluster"      "centers"      "totss"        "withinss"     "tot.withinss"
[6] "betweenss"    "size"         "iter"         "ifault"      

List of 9
 $ cluster     : Named int [1:69567] 4 2 4 4 4 1 4 1 4 4 ...
  ..- attr(*, "names")= chr [1:69567] "1" "2" "3" "4" ...
 $ centers     : num [1:4, 1:11] -0.302 0.668 -0.374 -0.132 0.407 ...
  ..- attr(*, "dimnames")=List of 2
  .. ..$ : chr [1:4] "1" "2" "3" "4"
  .. ..$ : chr [1:11] "Hispanic" "White" "Black" "Asian" ...
 $ totss       : num 7.31e+12
 $ withinss    : num [1:4] 1.76e+11 1.92e+11 1.73e+11 1.75e+11
 $ tot.withinss: num 7.15e+11
 $ betweenss   : num 6.6e+12
 $ size        : int [1:4] 17574 17698 8266 26029
 $ iter        : int 2
 $ ifault      : int 0
 - attr(*, "class")= chr "kmeans"
K-means clustering with 4 clusters of sizes 17574, 17698, 8266, 26029

Cluster means:
    Hispanic      White       Black      Asian Professional     Service
1 -0.3016974  0.4066104 -0.28173021  0.1074823    0.5711741 -0.43512866
2  0.6680011 -0.8809881  0.57590840 -0.1651807   -0.9293827  0.83254425
3 -0.3738848  0.4389987 -0.37976189  0.4559152    1.5331982 -0.92442949
4 -0.1317654  0.1850702 -0.08076331 -0.1050413   -0.2406168  0.02128076
       Office Construction Production Unemployment IncomePerCap
1  0.06629532   -0.2290380 -0.4319137  -0.46098899     32840.09
2 -0.05484069    0.3137655  0.5749681   0.89883620     14433.73
3 -0.17952039   -0.7693828 -1.0184110  -0.62970642     45780.77
4  0.04953752    0.1856318  0.2240905  -0.09992813     23412.84

Clustering vector:
 1  2  3  4  5  6  7  8  9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 
 4  2  4  4  4  1  4  1  4  4  4  4  4  2  4  1  4  2  1  1  4  4  1  2  4  4 
27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 45 46 47 48 49 50 51 52 53 
 2  4  1  1  3  1  1  4  1  3  4  1  1  4  4  4  4  4  2  2  2  2  2  4  4  2 
54 55 56 57 58 59 60 61 62 63 64 65 66 67 68 69 70 71 72 73 74 75 76 
 2  4  4  2  4  4  4  2  2  2  4  4  4  2  2  4  4  4  2  4  2  4  4 
 [ reached getOption("max.print") -- omitted 69492 entries ]

Within cluster sum of squares by cluster:
[1] 175536566241 191571930523 172553341760 175374034150
 (between_SS / total_SS =  90.2 %)

Available components:

[1] "cluster"      "centers"      "totss"        "withinss"     "tot.withinss"
[6] "betweenss"    "size"         "iter"         "ifault"      

List of 9
 $ cluster     : Named int [1:69567] 1 8 10 1 5 5 10 6 1 1 ...
  ..- attr(*, "names")= chr [1:69567] "1" "2" "3" "4" ...
 $ centers     : num [1:10, 1:11] -0.219 1.288 -0.361 -0.401 -0.269 ...
  ..- attr(*, "dimnames")=List of 2
  .. ..$ : chr [1:10] "1" "2" "3" "4" ...
  .. ..$ : chr [1:11] "Hispanic" "White" "Black" "Asian" ...
 $ totss       : num 7.31e+12
 $ withinss    : num [1:10] 1.12e+10 1.74e+10 1.39e+10 1.27e+10 1.15e+10 ...
 $ tot.withinss: num 1.28e+11
 $ betweenss   : num 7.18e+12
 $ size        : int [1:10] 10068 3865 3958 2488 8688 6934 7565 9957 5277 10767
 $ iter        : int 2
 $ ifault      : int 0
 - attr(*, "class")= chr "kmeans"
K-means clustering with 10 clusters of sizes 10068, 3865, 3958, 2488, 8688, 6934, 7565, 9957, 5277, 10767

Cluster means:
      Hispanic      White       Black        Asian Professional     Service
1  -0.21894266  0.3228535 -0.18234201 -0.087339706  -0.06461379 -0.11255017
2   1.28797955 -1.3954285  0.66203610 -0.227681863  -1.10820404  1.22557254
3  -0.36067127  0.4329611 -0.37594612  0.433931771   1.47068572 -0.88529754
4  -0.40143948  0.4574908 -0.41624907  0.552865396   1.86667764 -1.11560010
5  -0.26857421  0.3814098 -0.24191694  0.001449229   0.27733630 -0.29947970
6  -0.31091712  0.4238401 -0.30526537  0.133107865   0.63900706 -0.46799362
         Office Construction  Production Unemployment IncomePerCap
1   0.066108550   0.12256875  0.08790777  -0.24792612    25611.181
2  -0.076791743   0.32434445  0.48520322   1.65358131     9524.098
3  -0.150092261  -0.74228280 -0.99238368  -0.63110451    44529.439
4  -0.323599903  -0.93488726 -1.16996086  -0.66198939    51688.147
5   0.094878659  -0.05916413 -0.20555508  -0.38831757    29399.124
6   0.075622849  -0.27384931 -0.49047224  -0.48675137    33668.105
 [ reached getOption("max.print") -- omitted 4 rows ]

Clustering vector:
 1  2  3  4  5  6  7  8  9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 
 1  8 10  1  5  5 10  6  1  1 10 10  8  8  1  5 10  7  6  9  1  1  5  8 10 10 
27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 45 46 47 48 49 50 51 52 53 
 8  8  5  5  9  6  6 10  6  3 10  5  6 10  8  8  1  8  7  8  7  8  7  8 10  8 
54 55 56 57 58 59 60 61 62 63 64 65 66 67 68 69 70 71 72 73 74 75 76 
 8  8 10  7  8 10 10  8  8  8  8  1  8  7  8 10  8  1  8 10  7  8 10 
 [ reached getOption("max.print") -- omitted 69492 entries ]

Within cluster sum of squares by cluster:
 [1] 11229084259 17370175686 13910959117 12674139695 11527257287 11975510243
 [7] 11954310598 12586251071 12940548385 11846270069
 (between_SS / total_SS =  98.2 %)

Available components:

[1] "cluster"      "centers"      "totss"        "withinss"     "tot.withinss"
[6] "betweenss"    "size"         "iter"         "ifault"      

K-means is an unsupervised learning algorithm. The goal of this program is to find groups or clusters of data in order to identify certain patterns. All of the values in the data set were normalized to make comparisons of the overall dataset on a similar scale. K-means was used for 2,3,4, and 10 clusters. On inspection of the clusters created from k=2, The cluster that had the highest IncomePerCap at 37598 had the highest cluster mean of professional at 0.928, White at 0.420 and Asian at 0.237. the cluster plot chart has all the 70,000 datapoints in green and the two different clusters in blue and red respectively.It appears that there is overlap of the clusters however this occurs as the plot takes all the different data points and plots them on a two dimensional graph. With only two clusters it captures about 65.8% of the cluster sum of squares. Further inspection was constructed for a model with k =3. The cluster with the highest IncomePerCap was found to be cluster three at 42760. this cluster also had the highest cluster mean for Professional at 1.330 and Asian at 0.3837. The first cluster which had a IncomePerCap cluster mean of 16577 had the highest uneployment cluster average at 0.619. the cluster plot has three distinct clusters portrayed and the overlap makes it a little difficult to see which cluster is which. With only three clusters, 83.3% of the data is captured which is a drastric improvement from only two clusters. A final analysis was constructed for a model with k=4. The cluster with the highest IncomePerCap was found to be cluster three with 45781. this cluster had the hgihest Professional cluster average at 1.533 and the highest Asian cluster averge at 0.456. The cluster with the lowest IncomePerCap was cluster two at 14434. It had the highest unemployment cluster average at 0.8988. The cluster plot is difficult to interpret as the all of the datapoints were brought to a two dimensional scale and now there are four different clusters. With only four clusters, 90.2% of the data is captured which is a drastric improvement from only two clusters. As the clusters increased from 4 to 10, the percentage captured did not increase drastically. For example when k= 10, 98.2% of the data is captured. Therefore, a cluster of four would be sufficient

KNN

Preprocessing KNN

Besides prediction, classification was another method that was performed in order to understand income per capita of census tracts. Is it possible to classify future census tract’s income per capita based on the information given related to work and ethnic variations? To answer this question the dependent variable was transformed into a classifiable variable. In other words, income per capita, a numerical variable, was splitted into four groups: low, medium low, medium high, and high income per capita.

The function quant_groups was used. This separated the variable “income per capita” in quartiles and created a categorical variable with 4 categories. Each category has the same amount of observations (17418), and contains all census tracts ranging from the income per capita levels of each quartile. The new categorical variable was named “ipc4” and included:

  • Low: 0 - 18,000
  • Mid-Low: 18,800 - 24,700
  • Mid-High: 24,700 - 32,200
  • High: 32,000 - 56,000

Separately, another variable was created from “income per capita” with only 2 categories. The purpose is to compare how classifying income per capita using 4 categories differs from classifying it using 2 categories. Therefore, a second categorical variable was created and named “ipc2” containing:

  • Low: 0 - 24,700
  • High: 24,700 - 56,000

Both new categorical variables were added to the dataset “dat_1”.

'data.frame':   69567 obs. of  12 variables:
 $ Hispanic    : num  0.9 0.8 0 10.5 0.7 13.1 3.8 1.3 1.4 0.4 ...
 $ White       : num  87.4 40.4 74.5 82.8 68.5 72.9 74.5 84 89.5 85.5 ...
 $ Black       : num  7.7 53.3 18.6 3.7 24.8 11.9 19.7 10.7 8.4 12.1 ...
 $ Asian       : num  0.6 2.3 1.4 0 3.8 0 0 0 0 0.3 ...
 $ Professional: num  34.7 22.3 31.4 27 49.6 24.2 19.5 42.8 31.5 29.3 ...
 $ Service     : num  17 24.7 24.9 20.8 14.2 17.5 29.6 10.7 17.5 13.7 ...
 $ Office      : num  21.3 21.5 22.1 27 18.2 35.4 25.3 34.2 26.1 17.7 ...
 $ Construction: num  11.9 9.4 9.2 8.7 2.1 7.9 10.1 5.5 7.8 11 ...
 $ Production  : num  15.2 22 12.4 16.4 15.8 14.9 15.5 6.8 17.1 28.3 ...
 $ Unemployment: num  5.4 13.3 6.2 10.8 4.2 10.9 11.4 8.2 8.7 7.2 ...
 $ ipc2        : Factor w/ 2 levels "Low","High": 2 1 1 1 2 2 1 2 1 1 ...
 $ ipc4        : Factor w/ 4 levels "Low","Mid-Low",..: 3 1 2 2 3 3 2 4 2 2 ...

The dataset that was used to perform the KNN contained “ipc2” and “ipc4” along with the independent variables used to classify income per capita.

Train-Test split 3:1

First, the data was splitted into 80% training, and 20% test subsets. Two test sets were created: one for the categorical variable with 2 categories (ipc2) and another one for the categorical variable with 4 categories (ipc4).

KNN 2 categories

Selecting the correct “k”

The first KNN model performed tried to classify census tract’s income in two categories: Low or High. First, the chooseK() function was applied to determine the best KNN K value for the model.

 num [1:2, 1:15] 1 0.204 3 0.177 5 ...

The graph shows that 9 is approximately the best value for k because it provides the highest accuracy. In other words, the optimal amount of neighbors used to classify each observation is 9, because the accuracy does not improve substantially after adding more neighbors.

Results

             dat_ipc2.testLabels
dat_pred_ipc2  Low High
         High 1940 9719
         Low  9492 1685
[1] 0.1587406

As shown in the confusion matrix, the model can classify the income level of a census tract based on their work and ethnic demographics with an accuracy of 84%. For the test set, that corresponded to 19,210 observations predicted successfully out of 22,836. However, this high accuracy score corresponded only to the classification of two categories: low and high.

The reality of income in the U.S. is much more complex than that. Therefore, it makes sense to perform another model trying to classify 4 categories instead.

KNN 4 categories

Selecting the correct “k”

The next KNN model tried to classify census tract’s income in four categories: Low, Mid-low, Mid-High and High.

 num [1:2, 1:15] 1 0.563 3 0.597 5 ...

The chooseK() function applied to determine the best KNN K value for the model shows that 9 is approximately the best value for k as well, because it provides the highest accuracy.

Results

             dat_ipc4.testLabels
dat_pred_ipc4  Low Mid-Low Mid-High High
     Low      4120     982      165   19
     Mid-Low  1299    2996     1422  169
     Mid-High  238    1510     2923 1089
     High       66     221     1223 4394
[1] 0.6320284

As shown in the confusion matrix, the accuracy scored decreased from 84% to 63%. That means that is harder to classify income per capita when having 4 categories than having 2 categories. For the test set, that corresponded to 14,466 observations predicted successfully out of 22,836.

In reality, income per capita is a numerical variable. So the classification exercise served only as a proxy to understand how census tracts can be classified based on their demographics, but is not by any means a completely reliable tool to predict income in the U.S.

Linear Regression

'data.frame':   69672 obs. of  11 variables:
 $ Hispanic    : num  0.9 0.8 0 10.5 0.7 13.1 3.8 1.3 1.4 0.4 ...
 $ White       : num  87.4 40.4 74.5 82.8 68.5 72.9 74.5 84 89.5 85.5 ...
 $ Black       : num  7.7 53.3 18.6 3.7 24.8 11.9 19.7 10.7 8.4 12.1 ...
 $ Asian       : num  0.6 2.3 1.4 0 3.8 0 0 0 0 0.3 ...
 $ Professional: num  34.7 22.3 31.4 27 49.6 24.2 19.5 42.8 31.5 29.3 ...
 $ Service     : num  17 24.7 24.9 20.8 14.2 17.5 29.6 10.7 17.5 13.7 ...
 $ Office      : num  21.3 21.5 22.1 27 18.2 35.4 25.3 34.2 26.1 17.7 ...
 $ Construction: num  11.9 9.4 9.2 8.7 2.1 7.9 10.1 5.5 7.8 11 ...
 $ Production  : num  15.2 22 12.4 16.4 15.8 14.9 15.5 6.8 17.1 28.3 ...
 $ Unemployment: num  5.4 13.3 6.2 10.8 4.2 10.9 11.4 8.2 8.7 7.2 ...
 $ IncomePerCap: int  25713 18021 20689 24125 27526 30480 20442 32813 24028 24710 ...

#Basic Linear Regression We begin our predictive modeling with a simple linear regression model. This will allow us to make basic statements about how the variables of interest correlate with income, but won’t be an appropriate final model, since it does not allow for interactions between components which we have already shown to be inter-related. We begin by doing an exhaustive search of our variables of interest.

The exhaustive search indicates, by adjusted R^2, that we should include all of our variables except Professional. Since we perceive few costs to expanding the model, we simply take the model with the highest adjusted value. Note that Professional is excluded not because it is individually unhelpful, but because the set of work types sums to one, so one of them will always be dropped. We also did feature selection with BIC and Cp, with similar results. For brevity, we do not include them.


Call:
lm(formula = IncomePerCap ~ . - Professional, data = datDisc)

Residuals:
   Min     1Q Median     3Q    Max 
-57889  -3154   -136   3093  39362 

Coefficients:
              Estimate Std. Error t value Pr(>|t|)    
(Intercept)  51988.355    396.421  131.14   <2e-16 ***
Hispanic        41.671      3.749   11.11   <2e-16 ***
White           96.026      3.777   25.42   <2e-16 ***
Black           53.109      3.803   13.97   <2e-16 ***
Asian          128.931      4.808   26.82   <2e-16 ***
Service       -536.921      3.174 -169.15   <2e-16 ***
Office        -383.300      3.848  -99.62   <2e-16 ***
Construction  -444.860      4.003 -111.12   <2e-16 ***
Production    -533.878      3.054 -174.82   <2e-16 ***
Unemployment  -270.607      4.495  -60.20   <2e-16 ***
---
Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1

Residual standard error: 5519 on 69557 degrees of freedom
  (105 observations deleted due to missingness)
Multiple R-squared:  0.7102,    Adjusted R-squared:  0.7101 
F-statistic: 1.894e+04 on 9 and 69557 DF,  p-value: < 2.2e-16

Call:
lm(formula = IncomePerCap ~ . - Professional, data = datDisc)

Coefficients:
 (Intercept)      Hispanic         White         Black         Asian  
    51988.36         41.67         96.03         53.11        128.93  
     Service        Office  Construction    Production  Unemployment  
     -536.92       -383.30       -444.86       -533.88       -270.61  
    Hispanic        White        Black        Asian      Service       Office 
   17.508348    31.446591    16.187328     3.948750     1.475878     1.153010 
Construction   Production Unemployment 
    1.291156     1.195779     1.621923 
 (Intercept)     Hispanic        White        Black        Asian      Service 
 51988.35533     41.67122     96.02584     53.10864    128.93092   -536.92067 
      Office Construction   Production Unemployment 
  -383.29973   -444.85965   -533.87800   -270.60749 

Taking a simple linear regression, we find that we can predict the level income with reasonably high effectiveness (R^2 = `r )


Call:
glm(formula = ipc4 ~ . - Professional, family = "binomial", data = dat_ipc %>% 
    select(-ipc2))

Deviance Residuals: 
    Min       1Q   Median       3Q      Max  
-5.0524   0.0011   0.1756   0.4267   3.7614  

Coefficients:
              Estimate Std. Error z value Pr(>|z|)    
(Intercept)   9.196189   0.239132  38.457  < 2e-16 ***
Hispanic      0.006449   0.002061   3.129  0.00176 ** 
White         0.034181   0.002088  16.371  < 2e-16 ***
Black         0.016145   0.002089   7.730 1.08e-14 ***
Asian         0.046923   0.002858  16.416  < 2e-16 ***
Service      -0.158933   0.002283 -69.614  < 2e-16 ***
Office       -0.092472   0.002650 -34.893  < 2e-16 ***
Construction -0.108381   0.002590 -41.851  < 2e-16 ***
Production   -0.141521   0.002123 -66.665  < 2e-16 ***
Unemployment -0.145545   0.002835 -51.345  < 2e-16 ***
---
Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1

(Dispersion parameter for binomial family taken to be 1)

    Null deviance: 78240  on 69566  degrees of freedom
Residual deviance: 38592  on 69557  degrees of freedom
AIC: 38612

Number of Fisher Scoring iterations: 6
    Hispanic        White        Black        Asian      Service       Office 
   16.258238    25.125472    15.710318     2.872471     1.550756     1.503556 
Construction   Production Unemployment 
    1.443056     1.387141     1.127783 
Area under the curve: 0.8501

Ridge Regression

[1]  15 100

The dataset had 10 dependent variables predicting the independent variable. Ridge regression was introduced as it minimized the residual sum of squares and has a shrinkage penalty of lambda times by the sum of squares of the coefficients. Overall as lambda increases, the coefficients apprach zero. This plot indicates the entire path of variables as they shring towards zero. To build the ridge regression, a log sacle grid for the lambda values was constucted from 10^10 to 10^-2 in 100 segments.

Train and Test sets

To avoid introducing a bias in developing the Ridge and Lasso regression a train and test data set were introduced. To simulate a train and test set there was a random split into 50% for the train set.

[1] 824.8974
lowest lamda from CV:  824.8974 
MSE for best Ridge lamda:  12154666 

All the coefficients : 
 (Intercept)     Hispanic        White        Black        Asian Professional 
  19143.7093    -300.1618     436.8876     -34.8267     267.4880    1313.3468 
     Service       Office Construction   Production Unemployment 
   -988.9918    -358.9968    -465.5783    -707.8713    -806.0590 

R^2: 
[1] 0.8843423

In order to obtain the best model for Ridge regression, cross validation was implimented to find the best fit. The cross validation line graph indicates that a model with ten dependent variables would yield the best lambda with the lowest mean square error. As the lambda value decreases, the mean square error also decreases. Overall, Ridge Regression includes all the of the dependent variables and the best value for lambda is indicated by the first vertical dotted line. The lowest lamda from the cross validation was found to be 825. The MSE for the best Ridge Lambda equation was 30834392. from the equation, the model that had the most positive coefficient valus were professional at 3248, White at 1104 and Asian at 546. The values that had the strongest negative coeffiecents were service at -2195, production at -2021 and unemployment at -1602. It was interesting to note that only professional had a positive lambda while the other work variables were all negative. The R^2 value for the best Ridge model was found to be 0.707. this means that 70.7% of the variation in the income can be explained by the model.

Lasso Regression

lowest lamda from CV:  25.78395 
 MSE for best Lasso lamda:  11672982 

All the coefficients : 
 (Intercept)     Hispanic        White        Black        Asian Professional 
17484.872000  -212.822777   219.903152     0.000000   165.286025  1987.274255 
     Service       Office Construction   Production Unemployment 
 -284.732831     8.180852     0.000000    -4.656966  -619.976123 

The non-zero coefficients : 
 (Intercept)     Hispanic        White        Asian Professional      Service 
17484.872000  -212.822777   219.903152   165.286025  1987.274255  -284.732831 
      Office   Production Unemployment 
    8.180852    -4.656966  -619.976123 
[1] 0.8889258

Lasso regression was also implimented to see if this model would perform differently from the OLS or Ridge model. Lasso regression can be useful in reducing over-fittness and assist in model selection. from the line plot it can be seen that the three most positive coefficient values are professional at 6030.2, white at 1690, and asian at 712.2. This means that Professional, White, and Asian have much stronger positive pull on the data that the other variables. The three most negtive coefficient values are unemployment at -1613, service at -716.5, and producton at -622.3. Construction was found to have a coefficient value of 0.0 so it was removed for the final Lasso model. It is interesting to note that the lambda values for hispanic are small at 13.6 so they do not deviate much from the ordinary least squares model (OLS). Cross validation was introduced to select the lambda value with the lowest MSE. The CV recommended eight dependent variables be used to predict income. The Lasso regresison recommended that construction be removed from the equation. the Cross validaiton value was found to be 16.2 and the MSE for the best Lasso model was 30709528. Also the r^2 value was found to be 0.708. this means that 70.8% of the variation in Income can be explained by the model.


Call:
lm(formula = IncomePerCap ~ ., data = datJLClean)

Residuals:
     Min       1Q   Median       3Q      Max 
-27748.0  -1894.5    -20.2   1770.9  18988.4 

Coefficients: (1 not defined because of singularities)
             Estimate Std. Error  t value Pr(>|t|)    
(Intercept)  17391.74      36.71  473.707  < 2e-16 ***
Hispanic       104.54      54.22    1.928   0.0539 .  
White          669.66      72.93    9.183  < 2e-16 ***
Black          334.20      52.11    6.414 1.43e-10 ***
Asian          326.28      25.82   12.635  < 2e-16 ***
Professional  1077.33    2706.77    0.398   0.6906    
Service       -814.52    1609.32   -0.506   0.6128    
Office        -358.87    1173.52   -0.306   0.7598    
Construction  -378.99    1193.57   -0.318   0.7508    
Production    -515.94    1505.74   -0.343   0.7319    
Unemployment  -616.00      17.07  -36.087  < 2e-16 ***
ipc2High     19892.01      62.25  319.559  < 2e-16 ***
ipc4Mid-Low   5181.26      42.82  120.998  < 2e-16 ***
ipc4Mid-High -9860.07      41.82 -235.757  < 2e-16 ***
ipc4High           NA         NA       NA       NA    
---
Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1

Residual standard error: 3412 on 69553 degrees of freedom
Multiple R-squared:  0.8893,    Adjusted R-squared:  0.8892 
F-statistic: 4.296e+04 on 13 and 69553 DF,  p-value: < 2.2e-16

Call:
lm(formula = IncomePerCap ~ Hispanic + White + Black + Asian + 
    Professional + Service + Office + Production + Unemployment, 
    data = datJLClean)

Residuals:
   Min     1Q Median     3Q    Max 
-57889  -3155   -139   3092  39315 

Coefficients:
             Estimate Std. Error t value Pr(>|t|)    
(Intercept)  26167.82      20.93 1250.46   <2e-16 ***
Hispanic       972.98      87.56   11.11   <2e-16 ***
White         2983.19     117.35   25.42   <2e-16 ***
Black         1175.89      84.20   13.97   <2e-16 ***
Asian         1115.17      41.58   26.82   <2e-16 ***
Professional  5991.86      53.93  111.11   <2e-16 ***
Service       -737.90      39.77  -18.55   <2e-16 ***
Office         359.14      28.63   12.54   <2e-16 ***
Production    -667.56      40.62  -16.44   <2e-16 ***
Unemployment -1604.14      26.65  -60.19   <2e-16 ***
---
Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1

Residual standard error: 5519 on 69557 degrees of freedom
Multiple R-squared:  0.7102,    Adjusted R-squared:  0.7101 
F-statistic: 1.894e+04 on 9 and 69557 DF,  p-value: < 2.2e-16

MSE for full model : 
[1] 11638291

MSE for full model (w/o construction) : 
[1] 30460435

An OLS model was consturcted for both the full model and the full model without the construction variable to compare them to the Ridge and Lasso models. the R^2 value for both the OLS models was found to be 0.71. this means that both the ordinary least squares models explain 71% of the variation in income can be explainedby the model. Furthermore the MSE for the full model was found to be 30459848. The full model withouth the construction variable was found to have a larger MSE at 30460435. Overall the Lasso, Ridge, and both OLS models explain aobut roughly the same amount of variability in the data. Also all of the R^2 values are about the same around 0.70. Since the full OLS has the lowest MSE and the highest R^2 it would be a more suitable option than the Ridge, Lasso, or OLS without construction.

Conclusion

Overall, this analysis found that there are several ways in which our independent variables reliably predict income in communities across the United States. Ethnicity and work type proportions had stronger predictive power, with the latter having the most powerful effects. However, these variables suffer from being largely non-normal, with a rightward skew, and from having high internal correlations, both between and within the two categories. Altogether, these variables allow us to predict income per capita at the census tract level with high reliability (R-squared = .67); this is actually quite impressive given the simplicity of this data. For instance, it does not directly include any information about the age or education of the population.

Lasso and Ridge regression were introduced to see if other forms of regression would be more effective than ordinary least squares models. Cross validations was used for both the lasso and ridge regression to find the best combinations of coefficients that would have the lowest MSE. Although the models cannot be directly compared as each use different penalizing factors, the R^2 value was used for comparing how much variation in income can be explained by the respective models. The R^2 values for the Ridge Lasso regression were 0.707 and 0.708. The full OLS model has a slightly higher R^2 value at 0.711. So all the models explain about roughly the data since the OLS is slightly higher, that would be the recommend choice.

Clustering:

K-means was utilizes to see how the data would group together so clusters with different IncomePerCaps may have specific ethnicity or work variation cluster means. Clusters were constructed for k=2,3,4, and 10. Overall for larger income groups, it appears to have higher averages for Professional. Also, as more clusters were added, a greater percentage of data variation was explained. Four clusters became the optimal choice as it explained 90.2 % while 3 clusters explained 83.3% and 10 clusters explained 98.2%. There wasn’t much of an increase from 4 to 10 and just enough increase from 3 to 4.

Classification:

Moving forward, this analysis allows for several expansions. The first is to integrate new data, such as age and education status of census tract residents. Additionally, it may be valuable to consider each of the individual freedom measures on its own, to negate the influence of high internal correlation. Finally, it is interesting if there are differences driven by geographic density, which can be estimated with just the currently accessible data.

Bibliography

MuonNeutrino. (2015). US Census Demographic Data: Demographic and Economic Data for Tracts and Counties. UpToDate. Retrieved March 23, 2020, from https://www.kaggle.com/muonneutrino/us-census-demographic-dataD

U.S. Census Bureau (2019). “Annual Estimates of the Resident Population for the United States, Regions, States, and Puerto Rico: April 1, 2010 to July 1, 2019”. 2010-2019 Population Estimates. United States Census Bureau, Population Division. December 30, 2019. Retrieved January 27, 2020.

U.S. Census Bureau (2017). “American FactFinder - Results”. U.S. Census Bureau. Retrieved 2017-12-13.

U.S. Census Bureau (2013). “2010 Census Summary File 1: GEOGRAPHIC IDENTIFIERS”. American Factfinder. US Census. Retrieved 18 October 2013.